Sparse essential interactions in model networks of gene regulation 
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Gene regulatory networks typically have low in-degrees, whereby any given gene is regulated by 
few of the genes in the network. What mechanisms might be responsible for these low in-degrees? 
Starting with an accepted framework of the binding of transcription factors to DNA, we consider 
a simple model of gene regulatory dynamics. In this model, we show that the constraint of having 
a given function leads to the emergence of minimum connectivities compatible with function. We 
exhibit mathematically this behavior within a limit of our model and show that it also arises in 
the full model. As a consequence, functionality in these gene networks is parsimonious, i.e., is 
concentrated on a sparse number of interactions as measured for instance by their essentiality. Our 
model thus provides a simple mechanism for the emergence of sparse regulatory networks, and leads 
to very heterogeneous effects of mutations. 



PACS numbers: 

I. INTRODUCTION 

During the last decade, genomic studies have revealed 
that complex organisms typically do not have many more 
genes than less complex ones. Because of this, the 
paradigm for thinking about biological complexity has 
shifted from the number of genes to the way they may 
work together: higher complexities might be associated 
with a greater proportion of regulatory genes. In par- 
ticular, there are strong indications in eukaryotes and 
prokaryotes that for increasing genome size the number 
of regulatory genes grows faster than linearly in the to- 
tal number of genes [J Q. Hence it is appropriate to 
consider biological complexity in the framework of in- 
teraction networks. This shift from components to the 
associated interactions has received increasing attention 
in many scientific communities, with applications rang- 
ing from network biology to sociology. The relevance of 
this conceptual framework for biology has been repeat- 
edly emphasized (see, for example the review 0]) and has 
benefited from inputs from other fields and from statis- 
tical mechanics in particular 0] . We will therefore freely 
use the network terminology, refering to nodes, their de- 
grees, distinguishing between in and out degrees etc. 

From studies that strive to unravel gene regulatory net- 
works (GRN), several qualitative properties transpire: (i) 
a given gene is generally influenced by a "small" number 
of other genes (low in-degree of the network of interac- 
tions when compared to the largest possible degree); (ii) 
some genes are very pleiotropic (the out-degree of some 
nodes of the network can be high); (iii) GRN seem to be 
robust to change (e.g. to environmental fluctuations or 
to mutations), a feature that is also found at many other 



levels of biological organisation 

[IS Hi- A simple way 
to build robustness into a network is to have rather dense 
connections, effectively incorporating redundancy in a lo- 
cal or global way. Furthermore, the number of networks 
having m interactions grows very quickly with m. Thus 
when modeling GRN, the network realizations that per- 
form a given regulatory function are dominantly of very 
high degree. However this is not the case experimentally, 
at least with respect to the in-degree, and so models so 
far have had to build in limitations to the accessible con- 
nectivities [{| [l(J EH- In this work we show that such 
shortcomings of models can be overcome by taking into 
account the known mechanisms underlying genetic in- 
teractions: gene regulation is mediated via the molecu- 
lar recognition of DNA motifs by transcription factors, 
and this leads to biophysical constraints on interaction 
strengths. Within this relatively realistic framework, we 
shall see that networks are in fact driven to be parsi- 
monious (the essential interactions are sparse) for the 
in-degree while the out-degree is unconstrained. 

We begin by explaining the mechanisms incorporated 
into the model, in particular the determinants of the in- 
teractions. We follow standard practice [II, [III, Q3] when 
modeling interactions between DNA binding sites and 
transcription factors: the affinity is taken to depend on 
the mismatch between two character strings. We also 
specify how gene expression dynamics depend on these 
interactions and what "function" the networks must im- 
plement to be considered viable. 

Before proceeding, it is perhaps useful to point out 
briefly the main similarities and differences between the 
approach adopted in this work and in previous litera- 
ture. Our model belongs to the class known as "threshold 



2 



models", used widely in describing neural networks [15| 
and more recently in GRN modeling [jq j. Within such 
a framework, one represents the GRN by its matrix of 
connections, and mutations correspond to random mod- 
ifications of this matrix. In our approach, mutations 
are also random of course, but we mimic the underlying 
microscopic effects of a mutation and this forces us to 
work with weighted interactions; this more realistic way 
of treating mutations has rather striking consequences as 
we shall see. Note that we focus on generic aspects of 
the problem, without attempting to reproduce specific 
experimental data. 

After setting the general framework, we present some 
of the mathematical and numerical tools we use to ana- 
lyze the model. Results are first given for the full model, 
derived using computational methods. Then we focus 
on a limiting case of the model for which a mathemati- 
cal analysis can be pushed rather far. We demonstrate 
there that the constraint of having a given function makes 
the networks be marginally "viable" and that the corre- 
sponding connectivity is in a sense minimal, i.e., net- 
works are as sparse as they can be subject to maintain- 
ing their function. This same principle applies to the 
full model and remarkably, the simple limit proves to be 
an excellent approximation. Along with the spontaneous 
appearance of sparseness, we find that the network in- 
teractions are quite robust to change [1, [TtJ : only those 
few binding sites that are "effectively" used are fragile, 
mutations of the other (little used) binding sites have al- 
most no effect. Thus robustness to mutational changes 
is very high for most binding sites while the "essential" 
interactions have much lower robustness; robustness is 
heterogeneously distributed in the network. Implications 
of this sparseness are developed in the discussion; in par- 
ticular, a consequence is that redundant interactions are 
rapidly eliminated under evolution if no new function 
arises which might change the selection pressure. 



II. FRAMEWORK: MODEL OF INTERACTING 
GENES 

Gene expression dynamics and viability 

Our framework is an abstract GRN model belonging 
to a family of models that has been used many times 
by different authors [jj, QJ, GJ, So|, SJ. We consider 
N genes whose products can have regulatory influences 
on the same set of genes (retroaction) and possibly also 
have some "down-stream" consequences. However the 
consequences of these last effects can be ignored for our 
purposes since they lead to no feedback on the N "core" 
genes. Call Sj(t) the expression level of gene j at time t, 
in practice thought of as the concentration of the tran- 
scription factor [22j it produces. The dynamics of the 
Sj(t) takes place on biochemical time scales (typically 
minutes). To model that, we keep the spirit of earlier 
work 0, QjJ [2(| HH, taking the genes to be either on 



(Sj = 1) or off (Sj = 0). Furthermore, these expression 
levels are updated synchronously at discrete time steps: 
to go from time step fc to k + 1, we take 

JV 

S<* +1) =JT(£w^Sj fc) -fc). (1) 
i=i 

Here, h is a threshold and H is the Heaviside function, 
H(x) = for x < and H{x) = 1 for x > 0, while W tj 
denotes the strength of the interaction that gene j has on 
gene i. A priori, the Wij can be arbitrary and we have 
no built-in restriction on the network's connectivity. 

The "integrate and fire" functional form of Eq. [TJ is 
inspired by that arising in perceptrons [la ]. The other 
main family of models that have been used for modeling 
gene expression dynamics involve random boolean func- 
tions of the inputs [j| Since that framework does 
not provide a central role for weighted interactions, we 
have not considered here the use of this second family of 
models. 

Eq. [Tj defines a deterministic discrete dynamical sys- 
tem. After possible transient behavior, at large k the 
set of expression levels {5^ }j=i jy will either go to 
a (time-independent) steady state or will go into a cy- 
cle (periodic behavior). Which case arises may depend 
on the initial state of the expression levels. Following 
the motivation [16I | coming from early embryo devel- 
opment, we consider given the initial expression levels, 

{S^ in }i=i,...at. Furthermore, a network will "perform 
the desired function" if and only if, starting with the ini- 
tial expression pattern, it will lead to the desired steady- 
state gene expression levels; these must also be given a 

priori and correspond to the "target" {S'f torffe *' l };=i,...Ar. 
(More complicated choices, such as limit cycles, would 
also be possible.) Hereafter we say that a network is "vi- 
able" if it statisfies this functional property. Note that 
in our model, all genes are on an equal footing; it is then 
easy to see that the model's properties depend not on 
the details of the initial and target patterns, but only on 
the number of indices i where S^^ = sf ar9et ^ = and 
1. We shall show results when these numbers are set to 
their average values if each pattern is taken at random, 
but the results are not sensitive to this choice. 



Microscopic modeling of the interactions 

So far the framework is rather abstract, the interac- 
tions Wij are arbitrary. However much is known about 
how interactions are mediated in reality, and so it is ap- 
propriate to include this knowledge to obtain more realis- 
tic models. To begin, the product of gene j is a transcrip- 
tion factor, hereafter denoted TFj, i.e., a protein which 
modulates the rate at which other genes are transcribed. 
This modulation arises from the binding of TFj to the 
regulatory region of other genes (cf. Fig. [JJ. In the ab- 
sence of any bound transcription factors in its regulatory 
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region, gene i's level of transcription will be low (consid- 
ered here as off, Si — 0). We allow all of the N types of 
transcription factors to access all the regulatory regions, 
but binding depends on the affinity between the TF and 
the DNA content of these regions. To keep the model 
simple, we consider that each gene's regulatory region 
consists of N putative binding sites, one for each of the 
N types of TFs as illustrated in Fig. [TJ If gene j is "on" , 
it will produce a certain number n of TF molecules of 
type j; if it is off, it produces no TFs. We shall consider 
different values of n in our study, using the biologically 
relevant range 100 < n < I0 4 . (The lower value comes 
from the multiplicity of transcripts in E. coli [24| and 
the expected numbers of protein copies produced thereof, 
while the upper value comes from direct measurements 
of numbers of transcription factor molecules [HI].) 

To model the affinity between TFs and binding sites, 
we follow standard practice and represent each TF and 
binding site by a character string using a 4 letter al- 
phabet. The binding free-energy is then simply propor- 
tional to the mismatch between the two chains jl2l . [l3l . 
ItH . I26L I27I ]. This leads to the inverse Boltzmann fac- 
tor fiij « C e edij , for which Gerland et al. [3 have 
shown that the constant C is close to I and thus will be 
dropped hereafter. In this formula, dij is the Hamming 
distance (number of mismatches) between TFj and the 
jth binding site of gene i; furthermore, e is the penalty for 
each mismatch (contribution to binding energy in units 
of ksT). Experimentally, e is inferred to have a value be- 
tween one and three if we think of each base pair of the 
DNA as being represented by one character [2^, [29|, [3(| • 
The number L of characters used to represent a TF or its 
binding site is set using the typical number of base pairs 
in experimentally studied binding sites, 1 < L < 15. 

When Sj = 1, there are n TF molecules of type j that 
can bind to the jth site of gene i's regulatory region; 
given that this site can be occupied only by one TF at 
a time, it is common practice to take this occupation 
probability to be (l4| : 

Pa = t , > / ■ (2) 

1 + riij/n 

For our purposes, we simplify this relation by working in 
the regime of low competition as follows. First we define 
our Wij to be proportional to the Boltzmann factor: 

W ij = e- Ed » i,j = l,...,N. (3) 

When n times YjjZi WijSj is large enough, following 
Eq. [2], there is a high probability that at least one of 
the binding sites in gene i's regulatory region will be 
occupied [3J|. We thus set the threshold in Eq. [T] at 
a value h which is inversely proportional to the number 
n of TF molecules, h = 1/n. This parameter h plays a 
central role in the model so we shall investigate how its 
value influences the behavior of the network. 
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FIG. 1: Schematic representation of the regulatory region of 
gene i: there are N binding sites, each labeled by an index 
j (1 < j < N). Represented is the interaction Wij mediated 
by the binding of TF j to the j'th site of that region. The 
binding affinities depend on the mismatch between the string 
of length L representing the TF and that representing the 
DNA of the corresponding binding site. 



III. METHODS FOR MODEL ANALYSIS 
Uniform sampling of viable genotypes 

One can think of a GRN's genes and DNA binding sites 
as specifying its "genotype" ; equivalently, the genotype 
can be thought of as being given by the list of weights 
Wij, corresponding to a weighted oriented graph. Be- 
cause TF are typically pleiotropic, they are generally 
thought to evolve slowly, while DNA regulatory regions 
typically have a high level of polymorphism and may 
evolve more quickly. Thus in all our study we shall con- 
sider that the genes (and thus the TF they code) are fixed 
whereas the strings of characters representing the DNA 
binding sites are unconstrained. This defines the scope 
of the genotype space of our model. 

Now within this genotype space lies a small subset of 
viable genotypes, i.e., thoses genotypes which lead to 
the correct target expression levels given {Sj * }i=i,...N 
as starting point. These viable GRN have the desired 
function or "phenotype" ; note that the mapping from 
genotype to phenotype is generally many to one. 

It is relatively straightforward to sample uniformly all 
genotypes since they are specified by character strings. 
However, only a tiny part of this space corresponds to 
viable genotypes, the kind we are concerned with. To 
sample this much smaller space, we rely on Monte Carlo 
Markov Chains (MCMC). Within such a procedure, we 
start by producing (if necessary by design) a viable GRN; 
then we perform a random walk in the viable subspace of 
genotypes: at each step we propose a small change of the 
characters in one of the binding sites; if the new geno- 
type is viable, we accept it, otherwise we reject it and 
stay with the current genotype. From this procedure, we 
sample uniformly the space of viable genotypes, which al- 
lows us to generate many random viable networks; from 
these unbiased samplings we examine the statistical prop- 
erties imposed by viability. Properties include network 
sparseness, robustness to mutations and essentiality of 
interactions. Furthermore, the results will depend on the 
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"specificity" [H,Hlj,[3l| of the interactions between genes 
(through the alphabet size and the length of the character 
strings used in our matching process); this aspect plays 
an important role in understanding how such networks 
can both function and evolve, so we will consider how 
our results depend on it. 

Model implementation 

The DNA sequences of one of our networks of N genes 
can be represented via TV 2 strings of L characters in a 
four letter alphabet; indeed, a binding site for a given 
TF is represented by L characters, there are N binding 
sites per gene and a total of N genes. In our framework, 
each TF is considered as given while the space of all con- 
sidered networks arises from letting the DNA sequences 
be variable. Thus, instead of having an explicit repre- 
sentation of the strings associated with TFs and binding 
sites, it is enough to track a binary string of length L for 
each binding site, where for each entry the bit (respec- 
tively 1) stands for a mismatch (respectively a match) 
with the corresponding TF. The genotype then reduces 
to N 2 such binary strings. It is important to remember 
that there are 3 underlying possible characters for each 
bit at and only one if the bit is 1. When using this rep- 
resentation for our MCMC, the transition rate from a 
to a 1 bit must be three times smaller than the transition 
rate from a 1 to a bit. 

To ensure that our MCMC is ergodic on the time scales 
accessible to our computational ressources, we use a swap 
operator whereby we exchange the content of two ran- 
domly chosen neighbor binding sites. We call "step oper- 
ation" the following: (1) propose successively L random 
point mutations; (2) propose a single swap. A "sweep" is 
then the application of N 2 random step operations. The 
autocorrelation time of this MCMC was estimated to be 
less than the time of 1 sweep. 

Concerning the choices for S'jjf N and s!-=l 3et ' N : if 
drawn at random, the fraction of terms set to 1 would 
be approximately equal to that set to when N is large. 
To reduce finite size effects, we force the equality at our 
values of N which are multiples of 4. Because of the per- 
mutation symmetry of the model, one can always per- 
mute the indices so that s[ lnl ^ = 1 for i < N/2 and 
otherwise; furthermore we also impose without loss of 
generality s\ taraet) = 1 for N/4 < i < 37V/4 and oth- 
erwise. Notice that £ 2 s\ lm) = £V sf arget) = N/2 and 

J2 g( ini ) g(target) _ j ^ 

Finally, we need to start the MCMC with a viable 
GRN. To generate an initial genotype, we first set 

Wij = S [ taraet) Sf araet) i,j = l,...,N (4) 

and then construct the bit strings of the binding sites by 
taking values that approximate this equation. In practice 
this initial setting nearly always leads to a viable geno- 
type; if not, other approximations are tried. From this 



procedure an initial viable genotype is constructed and 
then the MCMC can begin. 



IV. RESULTS 
Some qualitative properties 

The total number of genotypes is 4 LN ; since realistic 
values of L are at least 10, this number is astronomical 
even for rather modest N. However only a tiny fraction 
of these genotypes are viable. Naively, since we want the 
gene expression pattern in the steady state to be given 
by s( tar9et >, and since there are 2 N possible patterns, one 
may expect only a fraction of order 2~ N of the genotypes 
to be viable. In fact, the fraction is even smaller, espe- 
cially as h grows. Thus if one seeks to generate random 
viable GRN by producing random genotypes (strings of 
characters for the binding sites), most attempts will be 
unsuccessful and it will be near impossible to sample the 
space of viable GRNs when TV is 3 or more. This is why 
we relied on Monte Carlo Markov Chains to perform the 
sampling of viable networks. (Such an approach is com- 
putationally efficient if the Markov Chain has a short 
auto-correlation time, which is the case here as men- 
tioned in the methods section.) Although it is difficult 
to derive properties of viable genotypes, general geno- 
types are relatively easy to understand because there is 
no viability constraint. The statistical properties of the 
interaction strengths Wij follows from the distribution of 
the mismatch between the character strings for a bind- 
ing site and a TF. Each of the L characters of a binding 
site gives a mismatch with probability 3/4; the total mis- 
match d thus follows the binomial distribution of mean 
3L/4: 

p(d)=( L ^(l/4) L - d (3/4) d (5) 

Sparse essential interactions 

In contrast to general genotypes, viable geno- 
types are subject to the constraint of reaching (af- 
ter some transients) the steady state expression given 

by {S\ tar9et ^}i = \....N when initialized in the state 

{iSj }i=i,...iv ■ How do the set of interactions Wij differ 
when comparing viable and general networks? We first 
address this question computationally by considering sta- 
tistical properties of random genotypes, subject to being 
viable or not. 

The main control parameter is the threshold value h, 
which must be compared to the typical value of • WijSj 
in the absence of the viability constraint; denote this 
value by w. Representing the averages over all genotypes 
by ( }, we have oj = N(Wij)/2. For this formula we have 
used the fact that because of the symmetry of the model 
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FIG. 2: Distribution of the Hamming distance between a TF 
and the receiving DNA site for N = 20, L = 12, £ = 1.5 
and for various values of the threshold parameter: (circle) 
h — 0.3, (square) h = 0.1, (diamond) h = 0.02, (triangle up) 
h = 0.005, (triangle down) h = 0.001. The lines are to guide 
the eye. 



in the absence of the viability constraint, the (Wij) are 
independent of i and j; we have also used (Si) = 1/2 for 
random initial and target expression states. Now when 
h is significantly larger than u, nearly all random geno- 
types will have their expression levels go to at large 
times and thus will not be viable. If a genotype is viable, 



it must be that V ■ WijS, 



(target) 



all i such that S, 



(target) 



is anomalously large for 

1. When this happens for such 
a row i, one may have one entry j* for which Wij* is very 
large, or one may spread out the "burden" among sev- 
eral interactions Wij 1 , Wij 2 , . . . Wij k where each weight is 
a bit larger than average. In the space of all viable geno- 
types, does one more often resort to the first or second 
strategy? To find out, we begin by considering the distri- 
bution of the mismatch between TFs and their binding 
sites in viable networks. We denote the mismatch by d; it 
is perhaps useful to regard the quantity L — dij as a mea- 
sure of the affinity between a TF and the corresponding 
DNA binding site in this model. In Fig. [5] we show the 
distribution of d when there are N — 20 genes, for in- 
creasing values of the threshold parameter h. At the low 
values of h, d has a binomial distribution with a peak 
near d = 3L/4 as expected. However one observes at 
small d significant deviations. In fact, as h increases, the 
viability constraint become marked and the distribution 
becomes bimodal: a peak appears at low mismatch val- 
ues. Note that this peak shifts as h increases, indicating 
that there are nearly perfect matches that appear in that 
regime. 

To distinguish the two scenarios, namely whether the 
burden in each row i is concentrated on one vs. on mul- 
tiple j indices, it is natural to consider the inverse par- 
ticipation ratio (IPR), a commonly used measure of how 
many terms effectively participate in a weighted list. The 
straightforward use of this approach for the matrix {Wij} 
is the IPR I = Y,i. 3 WyAEij W itj ) 2 . If many elements 
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FIG. 3: The mean of the inverse participation ratio per line 
(cf. Eq. [3) versus h for N = 20, L = 12,e = 1.5. The line is 
to guide the eye. 



effectively participate, the value will be 0(l/iV 2 ), while if 
only a few contribute per line the value will be not much 
less than 1/N. However this index is a poor indicator of 
sparseness for several reasons. First, in the absence of the 
viability constraint and when e is of order unity, the value 
is found to be of 0(\)\ The reason can be traced to the 
distribution of the Wij when the mismatches are binomi- 
ally distributed: because the weights Wij are exponential 
in the mismatches, I is often dominated by one or two 
interactions. Second, we are actually more interested in 
what happens for each row i such that g^ taraet% > — the 
other rows don't need good matches and just add noise to 
/. Thus it is appropriate to focus on the IPR restricted 
to one such line at a time. We therefore define 



a _ Q (target) Q (target) j A o T / A\2 



'(dij -3L/4) 
and the associated IPR for the z-th line 



h = 



12 j 



(6) 



(7) 



Then we average these Zj's over z's for which g^'"'^*) = \ 
This is the IPR quantity plotted in Fig. [3l Except at very 
low values of h, only one or two weights per row are sig- 
nificantly larger than the others in that same row. Note 
that the "staircase" structure of the plot is a consequence 
of the discrete nature of the mismatch value. 

It is possible to further explore the statistical proper- 
ties of viable genotypes by considering not the weights 
themselves but their function. One can ask whether "es- 
sential" interactions are sparse (few essential interactions 
per row), i.e., in a given viable genotype, how many of 
its interactions have the property that viability is lost 
when the interaction is removed (Wij is set to 0). We 
find that as soon as h is not too small, there is almost 
always just one essential interaction per gene as shown in 
Fig.llfor N = 20 and L = 12. The same result holds for 
other relevant values of N and L, suggesting that within 
our models, the drive towards sparse interactions arises 
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in regimes of biological relevance. We also considered a 
stronger measure of essentiality: we asked that the viabil- 
ity be lost when the interaction's mismatch is increased 
by one. Remarkably, the rule "one essential interaction 
per gene" generally held here too. Thus mutations in 
these interactions are typically deleterious, while muta- 
tions in the vast majority of the other interactions have 
no consequence on viability. This shows that mutational 
robustness is very heterogeneously distributed among the 
interactions in the network. 



Mathematical analysis in a simple limit 

As already mentioned, under the dynamical process 
Eq. [T], when the expression levels reach the target, the 
dynamics must be at a fixed point if the genotype is vi- 
able. Then the different lines of the matrix {W 7 ^} have to 
satisfy the "fixed point" constraint and in each line it is 
sufficient to consider those elements which are multiplied 
by 1. It is therefore worth considering a toy model where 
transient effects in the dynamics are neglected. In such 
a framework, we consider only the fixed point conditions 
that now can be thought of as coming from the particular 
choice S^ n ^ — g^ target ^ f or eacn j. For each such index i 
(or equivalently line of the matrix {W^}), the toy model 
leads to the partition function 
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-dje 



h) (8) 



where K = N/2 and H is the Heaviside function. We 
have also assumed that i is such that g( target > _ i p or 
the other lines (for which g^ tar s et ) = 0), if /i is not too 
small the fixed point condition will nearly always be sat- 
isfied and so can be ignored. Because in this toy model 
the different lines are independent, we can focus on one 
line at a time, in line with what arises when analyzing 
fixed points in neural network systems [l5j . 

In this reduced problem, the state space is a K- 
dimensional hypercube C with edge length L, C : 
(di,d2, . . . , dx), dj = 0, 1, . . . , L being the jth mismatch. 
The a priori mismatch probability in fact factorizes: 



K 



p{d 1 ,d 2 , . . .,d K ) = Y\p(di 



(9) 



with p(d) given by Eq. [5]. From this we see that 
[z\^y(h)] K gives the fraction of random genotypes that 
are viable. Notice, that for L large and for d small enough 
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< 1, d < L . 



(10) 



We wish to understand the effect of the Heaviside con- 
straint on the probability distribution of the mismatches 
and compare with what happens in the full model. We 
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FIG. 4: Probability distribution of the number of essential 
interactions per row of the matrix specifying a viable network 
for N = 20, L — 12, e = 1.5 and a range of values of h. 



first define dh = df l (e, h) by the equation e~ sdh = h and 
assume that h is such that Eq. [TO] holds for d < dh- 
Now remove from the state space C the sub-hypercube 
C: di > dh for all i = 1, . . . , K. In this reduced state 
space we keep the same probabilities up to a normaliza- 
tion factor: 



p(d-L,d 2 , 
p(di). 



■ ,d K ) 

■ p(d K ) 



n (i - H(d h - c^)) 



/A. (11) 



The factor in brackets is 1 within the reduced state space 
and vanishes outside of it, so it serves simply to filter out 
elements in C . Also, 



K 



A 



1 - 



d\,—,dh 3=1 



JJpid^Hid, - d h ) 



E 



p(m)H(d h - m)) 



K 



(12) 



Eqs. fTTirT!?] are exact. To proceed further we take advan- 
tage of Eq. |10j : in the r.h.s. of Eqs. p^HT2] we expand 
when possible the products and drop all quantities where 
p((i)'s with d < dh appear at higher order than first, e.g., 
. . . p(di)p(dj) . . . with dij < dh- This yields 



p(d 1 ,d 2 



K 

p{d 1 )...p(d K )YH(d h -d j )/K V p(m) (13) 



The marginal distribution of a mismatch, say d\, is ob- 
tained by summing over all di with i > 1. (Note that the 
marginals distributions do not depend on the value of the 
index.) Changing notation d\ — > d one easily obtains 



p(d) 



K -1 
K 



p(d) 



1 p{d)H(d h - d) 

K T, m <d h p( m ) 



(14) 
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h=l/10000 



1 2 e 3 

FIG. 5: The curves show the lower limit of the region where 
h = 10(JV/2)[0.25 + 0.75exp(~e)] L . 

Eq. [14] has a very simple interpretation: in the reduced 
state space, the shape of the probability distribution of 
any mismatch is essentially that without the viability 
constraint, but with an additional peak at small values 
of the mismatch. There is thus one "leading" mismatch 
taking care of most of the constraint, while the other 
mismatches behave approximately as if they were un- 
constrained. Such a behavior is exactly what we saw 
happened in the full model. In brief, the effect of the 
viability constraint condenses on one of the entries of the 
row considered, the other entries behave as if there were 
no constraint. Furthermore, one has sparseness of the es- 
sential interactions and a high IPR. This situation, where 
the IPR goes from low to high values as a parameter (h 
here) is increased, is reminiscent of many "condensation" 
phase transitions. In a biological context, such a tran- 
sition has been observed in another genetic system, but 
based on epistatic interactions [32| and otherwise unre- 
lated to our framework. It has also been seen in statistical 
physics problems [33j . 

In Eq. [T3], the neglected terms are order Kp(d) 
smaller than those kept. Hence to justify dropping those 
terms we must have 

KL dh /4 L ~ KL [ln{x/h)/ s ] /4 L < 1 . (15) 

The constraint imposed so far on the state space (exclu- 
sion of C) leads to simple formulae but it is stronger than 
the one imposed by the toy partition function, namely 

K 

e- sdi > h . (16) 

3=1 

If we replace in this relation the inequality by an equal- 
ity and assume that the dj's are continuous variables, we 
obtain the definition of a hypersurface, call it S, which 
is included in C . From this we see that the constraint 
Eq. [TB] removes not the whole hypercube C but only the 
points lying beyond S. The perturbative relation Eq. |14| 
is nevertheless an excellent approximation provided the 



probabilities associated with those points of C which re- 
main in the state space are very small. This is usually 
the case for values of parameters we consider and that 
are of biological interest. In fact, the quality of the ap- 
proximation could have been guessed given the form of 
the data shown in Fig[2l 



V. DISCUSSION AND CONCLUSIONS 

We considered a fairly general model of a gene regula- 
tory network (GRN) in which function is identified with 
reaching given target gene expression levels. By sim- 
ulation and mathematical analysis, we investigated the 
properties of networks in this model under the constraint 
that they be "viable" (i.e., have the desired function). 
We find that for a certain range of values of the model's 
parameters, the viability constraint leads to sparse GRN; 
we have quantified this through both inverse participa- 
tion ratios of the interactions and through the sparsity of 
"essential" interactions. Interestingly, the effects of the 
viability constraint condense onto just a few of the pos- 
sible binding sites, the others being non- functional. As 
a result, nearly all mutations of the binding sites have 
no effect on the viability and so such sites have a very 
high mutational robustness. However, for those few sites 
which bear the burden of the constraint, the majority of 
mutations are deleterious so their mutational robustness 
is low. Thus in our GRN, the mutational robustness is 
extremely heterogeneous from site to site. In addition, 
any "redundant" interaction is expected to become lost 
under evolutionary dynamics since mutations will remove 
it and condense the burden of viability onto a smaller 
number of interactions. 

Although our modeling of the regulation of gene ex- 
pression is relatively idealized (cf. the simple dynamical 
process Eq. Q]), other features of the model presented 
in this paper are fairly realistic; in particular we have 
insisted on including interactions through the biophysi- 
cal mechanism of molecular recognition and affinity. It is 
therefore interesting that the sparseness of GRNs comes 
out very naturally in this framework. This is well illus- 
trated by our numerics and by the analytic calculation in 
the previous section. It should be clear that this sparse- 
ness is a result of the combined effect of several causes: 
the viability constraint, the low probability of a small 
mismatch between TF and the binding site of DNA, the 
size L of this segment, the not-too-small spacing (in units 
of IcbT) between the energy levels that determine the 
strength of TF-DNA interactions, and finally the value 
of the threshold h itself in the expression dynamics of 
Eq. pQ . A necessary condition for our GRNs to be sparse 
is that this threshold h be significantly larger than the 
total strength contributed by random gene interactions 
in the absence of the viability constraint. For a four let- 
ter code, assuming, as we do, that the binding energy 
is additive in mismatches and that every mismatch costs 
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the same, one gets the condition 

h > — (0.25 + 0. 75t -) 



-e\L 



(17) 



Note that within a two letter code, the condition forces 
one to larger values of L (close to 20) and thus beyond 
what is realistic biologically. 

Since the model parameters correspond to measurable 
quanties, it is appropriate to compare to biological val- 
ues. According to Eq. [2] the probability that a TF 
occupies a DNA site is controlled by n, the number of 
these molecules. Thus, this probability is greater than 
1/2 when the corresponding Boltzmann factor is larger 
than 1/n. Hence, 1/n is an estimate of our threshold h: 
the reasonable range is roughly 1/10000 < h < 1/100. 
Interpreting 3> as "larger by one order of magnitude" , 
i.e., by a factor 10, one gets an allowed region in param- 
eter space as illustrated in Fig. [5j The predicted domain 
of relevance is above the corresponding curves (taken at 
N = 20 and illustrative values of h). We see that e and 
L should not be too small. Moreover, it is gratifying 
that the experimental range of these parameters (indi- 
cated by a rectangle) is near the border and, most of it, 
within this region. The model would remain meaningful 
if L and e were even larger. However, in the analogue of 
Fig. [21 the point at d = would dominate strongly over 
the few neighboring d points, and the system would be 
robust but not evolvable [111 [27j ■ Presumably both to 
reach functional GRN and to allow these to be evolvable, 
it is desirable not to be too dominated by essential in- 
teractions, so the probability of having two co-dominant 
mismatches should not be completely negligible. Such 
effects go beyond our model as we work within a "strong 



selection" limit: a GRN is viable or not, no graduation 
is allowed; when a continuous fitness replaces viability, 
evolvability should be strongly enhanced. 

It is worth emphasizing that as the number N of genes 
grows, it is necessary to increase slowly either L, e or 
h. e is constrained by biophysical processes and thus not 
evolvable, and L seems the best candidate for the system 
to adapt to increasing N 3l| . Note nevertheless that the 
effects of growing N are mild and that in practice regula- 
tion is modular, so effectively biological GRN have only 
modest values of N. Finally, as already mentioned in the 
introduction, our model belongs to the class of threshold 
models that have a much wider applicability than GRN. 
Therefore, the emergence of essential interactions follow- 
ing the mechanisms outlined in this paper is expected in 
all cases where the typical magnitude of the "local field" 
J2i WijSj is small compared to the threshold. 
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